How we tested and rated AI-generated dance videos

from Mohammed Al-Eu, Harry Johnson and Levi SumagasaiCalMatters

Zion Harris, center, rehearses for Jeté, a monthly dance showcase at Heart WeHo in West Hollywood in Los Angeles, on September 19, 2024. Photo by Alicia Jucevic for CalMatters

This story was originally published by CalMatters. Sign up for their newsletters.

AI models can create realistic videos with a simple text prompt. But these tools still struggle to generate realistic videos of complex natural movements, such as human dance.

When CalMatters and The Markup asked dancers and choreographers on whether AI could disrupt their industrymost concluded that human dancers could not be replaced.

Read our story: Our video tests prove that generative AI still sucks at dancing. See for yourself

For the most part, we found them to be right. We tested nine different cultural, modern and popular dance styles using four commercially available AI video generating models, generating a total of 36 videos. We found that the latest commercially available AI video generation models produce convincingly realistic videos of people dancing—but none produce a figure performing a prompted dance.

About a third of the generated videos show inconsistencies in the subject’s appearance from frame to frame, along with abnormalities in movement and limbs. The frequency and magnitude of the problems observed is a significant improvement over the initial testing in late 2024.

Methodology

Define a task

CalMatters and The Markup tested four commercial video generation models produced by major technology companies to create videos of traditional and popular dances.

We limited our tests to consumer-oriented, closed-source generative video tools because they are the most readily available to everyday users and generally perform better than open-source models. We tested Sora 2 by OpenAI, Veo 3.1 by Google, Kling 2.5 by Kuaishouand Hailou 2.3 by MiniMax.

Prepare prompts

We’ve produced nine video tutorials testing different dances in a variety of settings, such as dance floors, stages, bedrooms, studios, cultural events, public squares and classrooms. We tested popular, modern and traditional cultural dance styles, including macarena, mashed potatoes, folkloric and popular TikTok dances. Look Application for more details.

We varied the level of specificity to test whether identifying the dance by name was sufficient to generate a video of the desired movement or whether explicitly specifying the exact physical movements improved the result.

Before finalizing the list of prompts, we sent them to ChatGPT for edits based on Sora 2 Hint Guide. Look Limitations: Fast optimization for more details.

Send prompts to generate video

Each prompt was sent once using the default settings of each model to generate landscape oriented videos. Three prompts sent to Sora 2 were edited to remove words that triggered OpenAI’s filter, blocking prompts that may have violated “similarity railings to third-party content.” For example, Sora 2 flags prompts indicating specific years, popular music artists, and forbidden words. One blocked prompt was for a video of a politician dancing the Macarena. For this prompt, replacing “politician in a suit” with “man in a suit” went off the rails. Veo 3.1 flagged similar prompts when we sent via Twins or Stream but not when we fed directly to the Veo 3.1 API.

Rate the generated videos

We evaluated the generated videos on six different criteria related to the quick alignment and consistency of the video:

Was the main subject dancing in any way?
Did the main subject perform the particular dance we prompted?
Did the main subject maintain the same physical appearance throughout the video?
Does the main object create realistic movements based on human physiology?
Did the scene and setting match the prompt?
Does the camera match the prompted camera angle and position?

Each of the above criteria was scored as pass or fail by a single reviewer, with the assistance of a second reviewer when necessary. The generated cultural dance videos were reviewed for accuracy by dancers familiar with them.

Results

Of the 36 videos generated, all but one showed a dancing figure. The one video that did not show a dancing figure — produced by Kling 2.5 — instead features the lower half of a figure performing side kicks.

Neither video created the real dance we called for. For the Cahuilla Orchestra of the Native American Bird Dance, said tribal member Emily Clark“In my opinion, none of these images come close to the bird dance.” Videos of the Horton dance don’t show the specific dance move we prompted, but choreographer Emma Andre said she found the Veo 3.1’s rendering “amazingly lifelike.”

For the rest of the pop culture dances, we compared the generated videos to videos we found on YouTube to judge whether the dance was accurate.

11 out of 36 videos showed problems with movement or appearance consistency. This included sudden changes in the texture of clothing, hair or limbs, such as heads rotating on separate axes from their bodies and limbs liquefying and regenerating.

Look Application for full results and videos.

Restrictions

Image to video generation

We did not use images to suggest the patterns. Image to video generation involves uploading a static image along with a text prompt, creating a dynamic video from the two. Image to video generation is advertised use case for models that produce dance videos from images submitted by users.

Multi-story dance videos

We did not ask for videos with multiple dancers, although some of the dances are often performed in groups. We limited our video prompts to show a single dancer to avoid ambiguity as to whether a failing score was due to problems generating complex human motion or a realistic multi-subject video.

Fast optimization

We haven’t optimized prompts for every model. Each company publishes its own manual. (See guidelines for I see 3.1, Hello 2.5, Kling 2.3and Sora 2.) Instead, we used ChatGPT 5 to standardize prompts between models to align them with Sora 2 Hint Guide. Optimizing the model prompts according to their specific guides may have produced more accurate results.

We also tried to improve the quality of the videos by giving detailed step-by-step instructions for each dance. However, these instructions do not produce videos that are more accurate than those created with simpler prompts.

Models for generating human movements

We did not test generative models focused on the generation of human movements. These models are used in animation and video games to generate and capture natural human motion. Researchers are training some cutting-edge academic models in this space using large datasets including footage of popular dances on TikTok. Although these models may perform better than the user-oriented models we tested, they require technical expertise and significant computational resources to run.

Sample size

Our evaluation is limited to the generated videos for nine prompts; this is not a comprehensive assessment of the models used. Some video generation metrics like those from Tencent’s Artificial Intelligence Lab and othersuse several hundred prompts to test abilities such as complex movement, multiple objects, and creative style.

Acknowledgments

We thank Yuhang Yang (University of Science and Technology of China) and Xiaodong Cun (Great Bay University) for reviewing an early draft of this methodology.

Application

Review grades faster or model ratings.